Crate human_regex

Expand description

Regex for Humans

The goal of this crate is simple: give everybody the power of regular expressions without having to learn the complicated syntax. It is inspired by ReadableRegex.jl. This crate is a wrapper around the core Rust regex library.

Example usage

If you want to match a date of the format 2021-10-30, you could use the following code to generate a regex:

use human_regex::{beginning, digit, exactly, text, end};
let regex_string = beginning()
    + exactly(4, digit())
    + text("-")
    + exactly(2, digit())
    + text("-")
    + exactly(2, digit())
    + end();
assert!(regex_string.to_regex().is_match("2014-01-01"));

The to_regex() method returns a standard Rust regex. We can do this another way with slightly less repetition though!

use human_regex::{beginning, digit, exactly, text, end};
let first_regex_string = text("-") + exactly(2, digit());
let second_regex_string = beginning()
    + exactly(4, digit())
    + exactly(2, first_regex_string)
    + end();
assert!(second_regex_string.to_regex().is_match("2014-01-01"));

For a more extensive set of examples, please see The Cookbook.

Features

This crate currently supports the vast majority of syntax available in the core Rust regex library through a human-readable API.

Single Character

Implemented?	Expression	Description
`any()`	`.`	any character except new line (includes new line with s flag)
`digit()`	`\d`	digit (`\p{Nd}`)
`non_digit()`	`\D`	not digit
`unicode_category(UnicodeCategory)`	`\p{L}`	Unicode non-script category
`unicode_script(UnicodeScript)`	`\p{Greek}`	Unicode script category
`non_unicode_category(UnicodeCategory)`	`\P{L}`	Negated one-letter name Unicode character class
`non_unicode_script(UnicodeCategory)`	`\P{Greek}`	negated Unicode character class (general category or script)

Character Classes

Implemented?	Expression	Description
`or(&['x', 'y', 'z'])`	`[xyz]`	A character class matching either x, y or z (union).
`nor(&['x', 'y', 'z'])`	`[^xyz]`	A character class matching any character except x, y and z.
`within('a'..='z')`	`[a-z]`	A character class matching any character in range a-z.
`without('a'..='z')`	`[^a-z]`	A character class matching any character outside range a-z.
See below	`[[:alpha:]]`	ASCII character class (`[A-Za-z]`)
`non_alphanumeric()`	`[[:^alpha:]]`	Negated ASCII character class (`[^A-Za-z]`)
`or()`	`[x[^xyz]]`	Nested/grouping character class (matching any character except y and z)
`and(&[])`/`&`	`[a-y&&xyz]`	Intersection (a-y AND xyz = xy)
`(or[1,2,3,4] & nor(3))`	`[0-9&&[^4]]`	Subtraction using intersection and negation (matching 0-9 except 4)
`subtract(&[],&[])`	`[0-9--4]`	Direct subtraction (matching 0-9 except 4). Use .collect::<Vec> to use ranges.
`xor(&[],&[])`	`[a-g~~b-h]`	Symmetric difference (matching `a` and `h` only). Requires .collect() for ranges.
`or(&escape_all(&['[',']']))`	`[\[\]]`	Escaping in character classes (matching `[` or `]`)

Perl Character Classes

Implemented?	Expression	Description
`digit()`	`\d`	digit (`\p{Nd}`)
`non_digit()`	`\D`	not digit
`whitespace()`	`\s`	whitespace (`\p{White_Space}`)
`non_whitespace()`	`\S`	not whitespace
`word()`	`\w`	word character (`\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control}`)
`non_word()`	`\W`	not word character

ASCII Character Classes

Implemented?	Expression	Description
`alphanumeric()`	`[[:alnum:]]`	alphanumeric (`[0-9A-Za-z]`)
`alphabetic()`	`[[:alpha:]]`	alphabetic (`[A-Za-z]`)
`ascii()`	`[[:ascii:]]`	ASCII (`[\x00-\x7F]`)
`blank()`	`[[:blank:]]`	blank (`[\t ]`)
`control()`	`[[:cntrl:]]`	control (`[\x00-\x1F\x7F]`)
`digit()`	`[[:digit:]]`	digits (`[0-9]`)
`graphical()`	`[[:graph:]]`	graphical (`[!-~]`)
`uppercase()`	`[[:lower:]]`	lower case (`[a-z]`)
`printable()`	`[[:print:]]`	printable (`[ -~]`)
`punctuation()`	`[[:punct:]]`	punctuation ([!-/:-@\[-`{-~])
`whitespace()`	`[[:space:]]`	whitespace (`[\t\n\v\f\r ]`)
`lowercase()`	`[[:upper:]]`	upper case (`[A-Z]`)
`word()`	`[[:word:]]`	word characters (`[0-9A-Za-z_]`)
`hexdigit()`	`[[:xdigit:]]`	hex digit (`[0-9A-Fa-f]`)

Repetitions

Implemented?	Expression	Description
`zero_or_more(x)`	`x*`	zero or more of x (greedy)
`one_or_more(x)`	`x+`	one or more of x (greedy)
`zero_or_one(x)`	`x?`	zero or one of x (greedy)
`zero_or_more(x)`	`x*?`	zero or more of x (ungreedy/lazy)
`one_or_more(x).lazy()`	`x+?`	one or more of x (ungreedy/lazy)
`zero_or_more(x).lazy()`	`x??`	zero or one of x (ungreedy/lazy)
`between(n, m, x)`	`x{n,m}`	at least n x and at most m x (greedy)
`at_least(n, x)`	`x{n,}`	at least n x (greedy)
`exactly(n, x)`	`x{n}`	exactly n x
`between(n, m, x).lazy()`	`x{n,m}?`	at least n x and at most m x (ungreedy/lazy)
`at_least(n, x).lazy()`	`x{n,}?`	at least n x (ungreedy/lazy)

Composites

Implemented?	Expression	Description
`+`	`xy`	concatenation (x followed by y)
`or()`	`x\\|y`	alternation (x or y, prefer x)

Empty matches

Implemented?	Expression	Description
`beginning()`	`^`	the beginning of text (or start-of-line with multi-line mode)
`end()`	`$`	the end of text (or end-of-line with multi-line mode)
`beginning_of_text()`	`\A`	only the beginning of text (even with multi-line mode enabled)
`end_of_text()`	`\z`	only the end of text (even with multi-line mode enabled)
`word_boundary()`	`\b`	a Unicode word boundary (\w on one side and \W, \A, or \z on other)
`non_word_boundary()`	`\B`	not a Unicode word boundary

Groupings

Implemented?	Expression	Description
`capture(exp)`	`(exp)`	numbered capture group (indexed by opening parenthesis)
`named_capture(exp, name)`	`(?P<name>exp)`	named (also numbered) capture group
Handled implicitly through functional composition	`(?:exp)`	non-capturing group
See below	`(?flags)`	set flags within current group
See below	`(?flags:exp)`	set flags for exp (non-capturing)

Flags

Implemented?	Expression	Description
`case_insensitive(exp)`	`i`	case-insensitive: letters match both upper and lower case
`multi_line_mode(exp)`	`m`	multi-line mode: `^` and `$` match begin/end of line
`dot_matches_newline_too(exp)`	`s`	allow `.` to match `\n`
will not be implemented¹	`U`	swap the meaning of `x` and `x?`
`disable_unicode(exp)`	`u`	Unicode support (enabled by default)
will not be implemented²	`x`	ignore whitespace and allow line comments (starting with `#`)

With the declarative nature of this library, use of this flag would just obfuscate meaning.
When using human_regex, comments should be added in source code rather than in the regex string.

Modules

ascii

Functions for ASCII character classes

capturing

Functions for capturing matches

cookbook

A Cookbook of Common Tasks

direct

Functions for directly matching text or adding known regex strings

emptymatches

Functions for the empty matches

flags

Functions for adding flags

logical

Functions for performing logical operations

repetitions

Functions for matching repetitions

shorthand

Functions for general purpose matches

Structs

HumanRegex

The HumanRegex struct which maintains and updates the regex string. For most use cases it will never be necessary to instantiate this directly.

Enums

UnicodeCategory

An enum covering all Unicode character categories

UnicodeScript

An enum for covering all Unicode script categories

Functions

alphabetic

A function to match any alphabetic character ([A-Za-z])

alphanumeric

A function to match any alphanumeric character ([0-9A-Za-z])

and

A function for establishing an AND relationship between two or more possible matches

any

A function for matching any character (except for \n)

ascii

A function to match any ascii digit ([\x00-\x7F])

at_least

Match at least n of a certain target

beginning

A function to match the beginning of text (or start-of-line with multi-line mode)

beginning_of_text

A function to match the beginning of text (even with multi-line mode enabled)

between

Match at least n and at most m of a certain target

blank

A function to match blank characters ([\t ])

capture

Add a numbered capturing group around an expression

case_insensitive

Makes all matches case insensitive, matching both upper and lowercase letters.

control

A function to match control characters ([\x00-\x1F\x7F])

digit

A function for the digit character class (i.e., the digits 0 through 9)

disable_unicode

A function to disable unicode support

dot_matches_newline_too

A function that will allow . to match newlines (\n)

end

A function to match the end of text (or end-of-line with multi-line mode)

end_of_text

A function to match the end of text (even with multi-line mode enabled)

escape_all

Escapes an entire list for use in something like an [or] or an [and] expression.

exactly

Match exactly n of a certain target

graphical

A function to match graphical characters ([!-~])

hexdigit

A function to match any digit that would appear in a hexadecimal number ([A-Fa-f0-9])

lowercase

A function to match any lowercase character ([a-z])

multi_line_mode

Enables multiline mode, which will allow beginning() and end() to match the beginning and end of lines

named_capture

Add a named capturing group around an expression

non_alphabetic

A function to match any non-alphabetic character ([^A-Za-z])

non_alphanumeric

A function to match any non-alphanumeric character ([^0-9A-Za-z])

non_ascii

A function to match any non-ascii digit ([^\x00-\x7F])

non_blank

A function to match non-blank characters ([^\t ])

non_control

A function to match non-control characters ([^\x00-\x1F\x7F])

non_digit

A function for the non-digit character class (i.e., everything BUT the digits 0-9)

non_graphical

A function to match non-graphical characters ([^!-~])

non_hexdigit

A function to match any digit that wouldn’t appear in a hexadecimal number ([^A-Fa-f0-9])

non_lowercase

A function to match any non-lowercase character ([^a-z])

non_printable

A function to match unprintable characters ([^ -~])

non_punctuation

A function to match non-punctuation ([^!-/:-@\[-{-~]`)

non_unicode_category

A function for not matching Unicode character categories. For matching script categories see non_unicode_script.

non_unicode_script

A function for matching Unicode characters not belonging to a certain script category. For matching other categories see non_unicode_category.

non_uppercase

A function to match any non-uppercase character ([^A-Z])

non_whitespace

A function for the whitespace character class (i.e., everything BUT space and tab)

non_word

A function for the non-word character class (i.e., everything BUT the alphanumeric characters plus underscore)

non_word_boundary

A function to match anything BUT a word boundary

nonescaped_text

This text is not escaped. You can use it, for instance, to add a regex string directly to the object.

nor

Negated or relationship between two or more possible matches

one_or_more

Match one or more of a certain target

or

A function for establishing an OR relationship between two or more possible matches

printable

A function to match printable characters ([ -~])

punctuation

A function to match punctuation ([!-/:-@\[-{-~]`)

subtract

Subtracts the second argument from the first

text

Add matching text to the regex string. Text that is added through this function is automatically escaped.

unicode_category

A function for matching Unicode character categories. For matching script categories see unicode_script.

unicode_script

A function for matching Unicode characters belonging to a certain script category. For matching other categories see unicode_category.

uppercase

A function to match any uppercase character ([A-Z])

whitespace

A constant for the whitespace character class (i.e., space and tab)

within

Matches anything within a range of characters

without

Matches anything outside of a range of characters

word

A function for the word character class (i.e., all alphanumeric characters plus underscore)

word_boundary

A function to match a word boundary

xor

Xor on two bracketed expressions, also known as symmetric difference.

zero_or_more

Match zero or more of a certain target

zero_or_one

Match zero or one of a certain target